Detecting OOV Named-Entities in Conversational Speech
نویسندگان
چکیده
A common cause of errors in spoken language systems is the presence of out-of-vocabulary (OOV) words in the input. Named entities (people, places, organizations, etc.) are a particularly important class of OOVs. In this paper we focus on detecting OOV named entities (NEs) for two-way English/Iraqi speech-tospeech translation. Our approach builds on Maximum Entropy (MaxEnt) classifier trained on a suite of contextual features. These features include: n-gram context, part-of-speech tags (both supervised and unsupervised), and word posterior features computed from the trajectory of the word posteriors within the utterance. Our experimental results show that fusion (both early and late) of these novel word posterior features with rest of the contextual features significantly improves detection accuracy for OOV NEs. However, we also observe that the same features that perform well on OOV NEs can hurt in detecting in-vocabulary NEs. Therefore, the choice of the features should be based on expected occurrence of OOV NEs.
منابع مشابه
OOV Sensitive Named-Entity Recognition in Speech
Named Entity Recognition (NER), an information extraction task, is typically applied to spoken documents by cascading a large vocabulary continuous speech recognizer (LVCSR) and a named entity tagger. Recognizing named entities in automatically decoded speech is difficult since LVCSR errors can confuse the tagger. This is especially true of out-of-vocabulary (OOV) words, which are often named e...
متن کاملSource-Error Aware Phrase-Based Decoding for Robust Conversational Spoken Language Translation
Spoken language translation (SLT) systems typically follow a pipeline architecture, in which the best automatic speech recognition (ASR) hypothesis of an input utterance is fed into a statistical machine translation (SMT) system. Conversational speech often generates unrecoverable ASR errors owing to its rich vocabulary (e.g. out-of-vocabulary (OOV) named entities). In this paper, we study the ...
متن کاملVariable-Span out-of-vocabulary named entity detection
Out-of-vocabulary named entities (OOV NEs) are always misrecognized by fixed-vocabulary automatic speech recognition (ASR) systems. This has a negative impact on downstream applications such as language understanding and machine translation (MT). Automatic detection of OOV NEs in ASR hypotheses can help mitigate this problem by triggering the use of alternative approaches to acquire and process...
متن کاملNamed entity tagged language models
We introduce Named Entity (NE) Language Modelling, a stochastic nite state machine approach to identifying both words and NE categories from a stream of spoken data. We provide an overview of our approach to NE tagged language model (LM) generation together with results of the application of such a LM to the task of out-of-vocabulary (OOV) word reduction in large vocabulary speech recognition. ...
متن کاملTHE JOHNS HOPKINS UNIVERSITY Sub-Lexical and Contextual Modeling of Out-of-Vocabulary Words in Speech Recognition
Large vocabulary speech recognition systems fail to recognize words beyond their vocabulary, many of which are information rich terms, like named entities or foreign words. Hybrid word/sub-word systems solve this problem by adding sub-word units to large vocabulary word based systems; new words can then be represented by combinations of subword units. We present a novel probabilistic model to l...
متن کامل